Portable high-performance programs

نویسنده

  • Matteo Frigo
چکیده

This dissertation discusses how to write computer programs that attain both high performance and portability, despite the fact that current computer systems have different degrees of parallelism, deep memory hierarchies, and diverse processor architectures. To cope with parallelism portably in high-performance programs, we present the Cilk multithreaded programming system. In the Cilk-5 system, parallel programs scale up to run efficiently on multiple processors, but unlike existing parallel-programming environments, such as MPI and HPF, Cilk programs “scale down” to run on one processor as efficiently as a comparable C program. The typical cost of spawning a parallel thread in Cilk-5 is only between 2 and 6 times the cost of a C function call. This efficient implementation was guided by the work-first principle, which dictates that scheduling overheads should be borne by the critical path of the computation and not by the work. We show how the work-first principle inspired Cilk’s novel “two-clone” compilation strategy and its Dijkstra-like mutual-exclusion protocol for implementing the ready deque in the work-stealing scheduler. To cope portably with the memory hierarchy, we present asymptotically optimal algorithms for rectangular matrix transpose, FFT, and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are cache oblivious: no variables dependent on hardware parameters, such as cache size and cache-line length, need to be tuned to achieve optimality. Nevertheless, these algorithms use an optimal amount of work and move data optimally among multiple levels of cache. For a cache with size Z and cache-line length L where Z = (L2) the number of cache misses for an m n matrix transpose is (1 +mn=L). The number of cache misses for either an n-point FFT or the sorting of n numbers is (1 + (n=L)(1 + logZ n)). We also give a (mnp)-work algorithm to multiply an m n matrix by an n p matrix that incurs (1 + (mn+ np+mp)=L+mnp=LpZ) cache faults. To attain portability in the face of both parallelism and the memory hierarchy at the same time, we examine the location consistency memory model and the BACKER coherence algorithm for maintaining it. We prove good asymptotic bounds on the execution time of Cilk programs that use location-consistent shared memory. To cope with the diversity of processor architectures, we develop the FFTW self-optimizing program, a portable C library that computes Fourier transforms. FFTW is unique in that it can automatically tune itself to the underlying hardware in order to achieve high performance. Through extensive benchmarking, FFTW has been shown to be typically faster than all other publicly available FFT software, including codes such as Sun’s Performance Library and IBM’s ESSL that are tuned to a specific machine. Most of the performance-critical code of FFTW was generated automatically by a special-purpose compiler written in Objective Caml, which uses symbolic evaluation and other compiler techniques to produce “codelets”—optimized sequences of C code that can be assembled into “plans” to compute a Fourier transform. At runtime, FFTW measures the execution

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Parallel computing using MPI and OpenMP on self-configured platform, UMZHPC.

Parallel computing is a topic of interest for a broad scientific community since it facilitates many time-consuming algorithms in different application domains.In this paper, we introduce a novel platform for parallel computing by using MPI and OpenMP programming languages based on set of networked PCs. UMZHPC is a free Linux-based parallel computing infrastructure that has been developed to cr...

متن کامل

Achieving Performance Portability with SKaMPI for High-Performance MPI Programs

Current development processes for parallel software often fail to deliver portable software. This is because these processes usually require a tedious tuning phase to deliver software of good performance. This tuning phase often is costly and results in machine specific tuned (i.e., less portable) software. Designing software for performance and portability in early stages of software design re...

متن کامل

Compiling High Performance Fortran to Message Passing

ADAPTOR is a public domain High Performance Fortran compilation system that provides the comfortable data parallel programming paradigm on parallel machines with distributed memory. Therefore, the data parallel programs with their global view of data are translated to programs that work on the local parts of the distributed data and exchange the other needed data via message passing. This paper...

متن کامل

Performance Visualisation in a Portable Parallel Programming Environment

In order to obtain the highest possible performance from programs running on massively parallel machines it is essential to identify precisely where and when computational resources are consumed during their execution. A number of performance visualisation tools have evolved to meet this need for particular systems but they are often not portable to other machines. We regard portability as cruc...

متن کامل

OMPC++ | A Portable High-Performance Implementation of DSM using OpenC++ Re ection

Platform portability is one of the utmost demanded properties of a system today, due to the diversity of runtime execution environment of wide-area networks, and parallel programs are no exceptions. However, parallel execution environments are VERY diverse, could change dynamically, while performance must be portable as well. As a result, techniques for achieving platform portability are someti...

متن کامل

High-Level Portable Programming Language for Optimized Memory Use of Network Processors

Network processors (NPs) are widely used for programmable and high-performance networks; however, the programs for NPs are less portable, the number of NP program developers is small, and the development cost is high. To solve these problems, this paper proposes an open, high-level, and portable programming language called “Phonepl”, which is independent from vendor-specific proprietary hardwar...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999